Artificial data

First, generate data:

set.seed(0)
datobj = generate_data_generic(p=5, TT=300, fac=.5, nt=2000, dimdat = 3)
ylist = datobj$ylist
X = datobj$X

This produces three dimensional cytograms ylist and covariates X.

ylist is a list of length \(T=300\), the number of time points (or cytograms). Each element of ylist is an array with \(d=3\) rows (a single cytogram) and \(n_t\) columns. The number of columns \(n_t\) of each element in ylist can be different.

X is a \(T \times d\) matrix, whose \(t\)’th rows contain the relevant (environmental) factors of the \(t\)’th cytogram.

plot(ylist[[1]][,1:2], ylab="", xlab="", pch=16, col=rgb(0,0,1,0.2), cex=.5)

Especially if your data is a time series, it could be useful to plot the covariates once across time.

matplot(X, type = 'l')

Now, we estimate the data with fixed regularization parameters \(\lambda_\alpha=0.01\) and \(\lambda_\beta=0.01\), and \(K=10\) clusters.

Internally, flowmix() repeats the estimation 5 (the default) times, and returns the estimated model providing the best data fit.

numclust = 4
set.seed(0)
res = flowmix(ylist = ylist, X = X, numclust = numclust,
              mean_lambda = 0.01, prob_lambda = 0.001,
              nrep = 5)
print(res)
## [1] "57 out of 60 regression coefficients for the cluster centers, are zero."
## [1] "18 out of 20 regression coefficients for the cluster probabilities, are zero."
## [1] "The algorithm converged in 16 EM iterations."

The cluster probabilities look like this:

plot_prob(res)

The estimated cluster means along the three dimensions look like this:

par(mfrow = c(1,3), cex = 1.2)
for(idim in 1:3){
  res$mn[,idim,] %>% matplot(type = 'l',
                             lty = 1, ylab = paste0("Mean, dim=", idim))
}

The estimated cluster means, drawn on scatterplots of the data two dimensions at a time, look like this:

par(mfrow = c(1,3))
ylim = c(-3,8)
xlim = c(-5,8)
for(dims in list(c(1,2), c(2,3), c(3,1))){
  scatterplot_2d(ylist, res, 100, dims = dims, cex_fac=1, ylim=ylim, xlim=xlim)
}

Next, the first two dimensions only, shown at three different time points:

par(mfrow = c(1,3))
for(tt in c(1,50,200)){
  scatterplot_2d(ylist, res, tt,
                 dims = c(1,2),
                 cex_fac = 1,
                 ylim = ylim,
                 xlim = xlim)
  title(main = paste0("t=", tt, " out of ", res$TT))
}

Showing the model across time, in an animation:

par(mfrow = c(1,3), oma = c(2,2,2,2))
ylim = c(-3,8)
xlim = c(-5,8)
for(tt in 1:res$TT){
  for(dims in list(c(1,2), c(2,3), c(3,1))){
    scatterplot_2d(ylist, res, tt, dims = dims, cex_fac=1, ylim=ylim, xlim=xlim)
  }
  mtext(outer = TRUE,
        text = paste0("t=", tt, " out of ", res$TT),
        cex = 2)
}

Cross-validation

Obtain the maximum lambda values

The maximum values for the candidate regularization parameters \(\lambda_\alpha\) and \(\lambda_\beta\), to be used for cross-validation, can be numerically obtained:

maxres = get_max_lambda(destin,
                        "maxres.Rdata",
                        ylist = ylist,
                        countslist = NULL,
                        X = X,
                        numclust = 4,
                        maxdev = 0.5,
                        max_mean_lambda = 40,
                        max_prob_lambda = 2)

Now setting a few things up.

## Define the locations to save the CV.
destin = "." 

## Define the CV folds (as every fifth, nfold-sized, block of indices)
folds = make_cv_folds(ylist, nfold = 5, verbose = FALSE, blocksize = 20) 

## Define the candidate lambda values (logarithmically spaced)
cv_gridsize = 5
## maxres = list(alpha = 1, beta=1)
prob_lambdas =  logspace(min = 0.0001, max = maxres$alpha, length = cv_gridsize)
mean_lambdas = logspace(min = 0.0001, max = maxres$beta, length = cv_gridsize)

One EM algorithm = one job

Next, what we call “one job” (using the function one_job(ialpha, ibeta, ifold, irep)) is to run the EM algorithm once, for:

  • the ialpha’th \(\lambda_\alpha\) value (out of prob_lambdas).
  • the ibeta’th \(\lambda_\alpha\) value (out of mean_lambdas).
  • the ifold’th test fold out of the nfold=5 CV folds.
  • the irep’th repeat of the EM algorithm (nrep=10 in total)
## Example of one CV job for one pair of regularization parameters (and CV folds
## and EM replicates)
ialpha = 1
ibeta = 1
ifold = 1
irep = 1
destin = "~/Desktop"## Change to your target destination.
one_job(ialpha = ialpha,
        ibeta = ibeta,
        ifold = ifold,
        irep = irep,
        folds = folds,
        destin = destin,
        mean_lambda = mean_lambdas, prob_lambdas = prob_lambdas,
        ## The rest that is needed explicitly for flowmix()
        ylist = ylist,
        countslist = NULL,
        X = X,
        numclust = 4,
        maxdev = 0.5,
        ## verbose = TRUE
        )

Also, the nrep estimated models for any given ialpha and ibeta (in the full data) are obtained using one_job_refit():

## Example of one replicate of model estimation (in the full data) for one pair
## of regularization parameters.
ialpha = 1
ibeta = 1
irep = 1
destin = "~/Desktop"## Change to your target destination.
one_job_refit(ialpha = ialpha,
              ibeta = ibeta,
              irep = irep,
              destin = destin,
              mean_lambda = mean_lambdas, prob_lambdas = prob_lambdas,
              ## The rest that is needed explicitly for flowmix()
              ylist = ylist,
              countslist = NULL,
              X = X,
              numclust = 4,
              maxdev = 0.5,
              ## verbose = TRUE
              )

(Since all of this is clearly parallelizable, it’s recommended to use multiple computers or servers for the full cross-validation.)

A single function

cv.flowmix conducts cross-validation by wrapping most of the above into a single function.

(This code takes long, so it’s recommended that you run it separately in a script; use mc.cores option to run the jobs on multiple cores):

cvres = cv.flowmix(ylist = ylist,
                   countslist = NULL,
                   X = X,
                   maxdev = 0.5,
                   numclust = 4,
                   prob_lambdas = prob_lambdas,
                   mean_lambdas = mean_lambdas,
                   nrep = 5,
                   nfold = 5,
                   ## nrep = 1,
                   ## nfold = 2,
                   ## verbose = TRUE,
                   destin = "~/Desktop",
                   mc.cores = 8)

Then, the results are saved into separate files whose names follow these rules: - “1-1-1-1.Rdata” for ialpha-ibeta-irep-ifold.Rdata, having run the CV. - “1-1-1-cvres.Rdata” for having estimated the model in the full data

Then, the results are summarized, and optionally saved to a file summary.RDS, if save=TRUE:

cvres = cv_summary(destin = ".",
                   cv_gridsize = 5,
                   nrep = 5,
                   nfold = 5,
                   save = TRUE,
                   filename = "summary.RDS")

The model chosen by cross-validation is this:

cvres$bestres %>% print()

Binning data

If the data contains too many particles, we can reduce the size of ylist and instead deal with binned counts.

The new object countslist can be additionally input to flowmix().

Here is an example. make_grid(ylist, gridsize=30) makes an equally sized grid of size 30 from the data range, in each dimension. Then, bin_many_cytograms() places the particles in ylist in each of these bins. The resulting object is a list which contains the grid centers ybin_list and the counts in each counts_list.

(Not used here, but optionally, you can upweight each particle, e.g. using the biomass of each particle, for instance.)

## Bin this data
grid = make_grid(ylist, gridsize = 30, grid.ind=FALSE)
obj = bin_many_cytograms(ylist, grid, mc.cores = 8, verbose=FALSE)  
ylist = obj$ybin_list
countslist = obj$counts_list

## Run the algorithm on binned data
res = flowmix(ylist = ylist,
              X = X,
              countslist = countslist,
              numclust = numclust,
              mean_lambda = 0.01,
              prob_lambda = 0.01,
              verbose = FALSE,
              maxdev = 0.5)

Real data

You can repeat the above code blocks, with real data.

## Load data
load(file = "~/repos/flowmix/demo-MGL1704.Rdata")
X = X %>% select(-time, -lat, -lon) %>% as.matrix()
ylist = ybin_list
countslist = biomass_list

## Estimate model
la('flowmix')
set.seed(1)
res = flowmix(ylist, X, numclust = 10,
              countslist = countslist,
              mean_lambda = 0.001,
              prob_lambda = 0.01,
              maxdev = 0.5,
              nrep = 1,
              verbose = TRUE)
## 
 EM iterations.   1 out of 999 with lapsed time 0 seconds and remaining time 0 seconds and will finish at 2021-02-06 22:31:17
 EM iterations.   2 out of 999 with lapsed time 5 seconds and remaining time 2492 seconds and will finish at 2021-02-06 23:12:55
 EM iterations.   3 out of 999 with lapsed time 8 seconds and remaining time 2656 seconds and will finish at 2021-02-06 23:15:42
 EM iterations.   4 out of 999 with lapsed time 11 seconds and remaining time 2736 seconds and will finish at 2021-02-06 23:17:04
 EM iterations.   5 out of 999 with lapsed time 13 seconds and remaining time 2584 seconds and will finish at 2021-02-06 23:14:35
 EM iterations.   6 out of 999 with lapsed time 16 seconds and remaining time 2648 seconds and will finish at 2021-02-06 23:15:41
 EM iterations.   7 out of 999 with lapsed time 18 seconds and remaining time 2551 seconds and will finish at 2021-02-06 23:14:07
 EM iterations.   8 out of 999 with lapsed time 21 seconds and remaining time 2601 seconds and will finish at 2021-02-06 23:14:59
 EM iterations.   9 out of 999 with lapsed time 23 seconds and remaining time 2530 seconds and will finish at 2021-02-06 23:13:51
 EM iterations.   10 out of 999 with lapsed time 25 seconds and remaining time 2472 seconds and will finish at 2021-02-06 23:12:55
 EM iterations.   11 out of 999 with lapsed time 28 seconds and remaining time 2515 seconds and will finish at 2021-02-06 23:13:40
 EM iterations.   12 out of 999 with lapsed time 30 seconds and remaining time 2468 seconds and will finish at 2021-02-06 23:12:56
 EM iterations.   13 out of 999 with lapsed time 33 seconds and remaining time 2503 seconds and will finish at 2021-02-06 23:13:33
 EM iterations.   14 out of 999 with lapsed time 35 seconds and remaining time 2462 seconds and will finish at 2021-02-06 23:12:55
 EM iterations.   15 out of 999 with lapsed time 38 seconds and remaining time 2493 seconds and will finish at 2021-02-06 23:13:28
 EM iterations.   16 out of 999 with lapsed time 40 seconds and remaining time 2458 seconds and will finish at 2021-02-06 23:12:56
 EM iterations.   17 out of 999 with lapsed time 43 seconds and remaining time 2484 seconds and will finish at 2021-02-06 23:13:24
 EM iterations.   18 out of 999 with lapsed time 45 seconds and remaining time 2452 seconds and will finish at 2021-02-06 23:12:54
 EM iterations.   19 out of 999 with lapsed time 48 seconds and remaining time 2476 seconds and will finish at 2021-02-06 23:13:21
 EM iterations.   20 out of 999 with lapsed time 50 seconds and remaining time 2448 seconds and will finish at 2021-02-06 23:12:55
 EM iterations.   21 out of 999 with lapsed time 52 seconds and remaining time 2422 seconds and will finish at 2021-02-06 23:12:32
 EM iterations.   22 out of 999 with lapsed time 55 seconds and remaining time 2442 seconds and will finish at 2021-02-06 23:12:54
 EM iterations.   23 out of 999 with lapsed time 57 seconds and remaining time 2419 seconds and will finish at 2021-02-06 23:12:33
 EM iterations.   24 out of 999 with lapsed time 59 seconds and remaining time 2397 seconds and will finish at 2021-02-06 23:12:13
 EM iterations.   25 out of 999 with lapsed time 61 seconds and remaining time 2377 seconds and will finish at 2021-02-06 23:11:56
 EM iterations.   26 out of 999 with lapsed time 64 seconds and remaining time 2395 seconds and will finish at 2021-02-06 23:12:16
 EM iterations.   27 out of 999 with lapsed time 66 seconds and remaining time 2376 seconds and will finish at 2021-02-06 23:11:59
 EM iterations.   28 out of 999 with lapsed time 68 seconds and remaining time 2358 seconds and will finish at 2021-02-06 23:11:43
 EM iterations.   29 out of 999 with lapsed time 70 seconds and remaining time 2341 seconds and will finish at 2021-02-06 23:11:28
 EM iterations.   30 out of 999 with lapsed time 72 seconds and remaining time 2326 seconds and will finish at 2021-02-06 23:11:15
 EM iterations.   31 out of 999 with lapsed time 74 seconds and remaining time 2311 seconds and will finish at 2021-02-06 23:11:03
 EM iterations.   32 out of 999 with lapsed time 76 seconds and remaining time 2297 seconds and will finish at 2021-02-06 23:10:51
 EM iterations.   33 out of 999 with lapsed time 78 seconds and remaining time 2283 seconds and will finish at 2021-02-06 23:10:39
 EM iterations.   34 out of 999 with lapsed time 81 seconds and remaining time 2299 seconds and will finish at 2021-02-06 23:10:57
 EM iterations.   35 out of 999 with lapsed time 83 seconds and remaining time 2286 seconds and will finish at 2021-02-06 23:10:46
 EM iterations.   36 out of 999 with lapsed time 85 seconds and remaining time 2274 seconds and will finish at 2021-02-06 23:10:36
 EM iterations.   37 out of 999 with lapsed time 87 seconds and remaining time 2262 seconds and will finish at 2021-02-06 23:10:26
 EM iterations.   38 out of 999 with lapsed time 89 seconds and remaining time 2251 seconds and will finish at 2021-02-06 23:10:17
 EM iterations.   39 out of 999 with lapsed time 91 seconds and remaining time 2240 seconds and will finish at 2021-02-06 23:10:08
 EM iterations.   40 out of 999 with lapsed time 93 seconds and remaining time 2230 seconds and will finish at 2021-02-06 23:10:01

Now visualizing the results as before.

## Default print
print(res)
## [1] "776 out of 1110 regression coefficients for the cluster centers, are zero."
## [1] "344 out of 370 regression coefficients for the cluster probabilities, are zero."
## [1] "The algorithm converged in 41 EM iterations."
## Plot estimated probabilities
plot_prob(res)

## Three scatterplots of one time point
par(mfrow = c(1,3))
ylim = c(-3,8)
xlim = c(-5,8)
dimnames = c("diam", "red", "orange")
for(dims in list(c(1,2), c(2,3), c(3,1))){
  scatterplot_2d(ylist = ylist,
                 countslist = countslist,
                 obj = res,
                 300,
                 dims = dims, cex_fac=8,
                 pt_col = rgb(0 ,0, 1, 0.1),
                 xlab = dimnames[dims[1]],
                 ylab = dimnames[dims[2]])
}